NLSY Parent Income and Wealth Imputation

Data Summary

Table 1: Data summary, continuous variables
Unique (#) Missing (%) Mean SD Min Median Max
age_resp 5 0 14.0 1.4 12.0 14.0 16.0
bdate_y 5 0 1982.0 1.4 1980.0 1982.0 1984.0
pincome 2234 27 46483.5 42112.7 0.0 37900.0 246474.0
pnetworth 2507 26 90429.7 137578.8 −935251.0 34500.0 600000.0
retsav1 228 75 44237.2 63017.9 0.0 20000.0 300000.0
mom_age_birth 31 7 25.5 5.4 16.0 25.0 45.0
mom_age 35 7 39.5 5.5 28.0 39.0 61.0
has_retsav 3 13 0.5 0.5 0.0 0.0 1.0
owns_home 3 18 0.6 0.5 0.0 1.0 1.0
both_parents 2 0 0.5 0.5 0.0 0.0 1.0
Table 2: Data summary, categorical variables
N %
mom_educ_hs Less than high school 1779 20.0
High-school graduate 2860 32.1
Some college 1877 21.1
College degree 1057 11.9
Graduate degree 284 3.2
dad_educ_hs Less than high school 1142 12.8
High-school graduate 1913 21.5
Some college 1224 13.8
College degree 894 10.0
Graduate degree 338 3.8
race_eth AAPI Hispanic 4 0.0
AAPI NonHispanic 156 1.8
AIAN Hispanic 18 0.2
AIAN NonHispanic 43 0.5
Black Hispanic 54 0.6
Black NonHispanic 2333 26.2
Other Race Hispanic 944 10.6
Other Race NonHispanic 119 1.3
White Hispanic 819 9.2
White NonHispanic 4406 49.5

Imputation of Parental Income and Networth

Choices

From mice package documentation:

  1. First, we should decide whether the missing at random (MAR) assumption (Rubin 1976) is plausible. The MAR assumption is a suitable starting point in many practical cases, but there are also cases where the assumption is suspect. Schafer (1997, pp. 20–23) provides a good set of practical examples. MICE can handle both MAR and missing not at random (MNAR). Multiple imputation under MNAR requires additional modeling assumptions that influence the generated imputations. There are many ways to do this. We refer to Section 6.2 for an example of how that could be realized.
  2. The second choice refers to the form of the imputation model. The form encompasses both the structural part and the assumed error distribution. Within MICE the form needs to be specified for each incomplete column in the data. The choice will be steered by the scale of the dependent variable (i.e., the variable to be imputed), and preferably incorporates knowledge about the relation between the variables. Section 3.2 describes the possibilities within mice 2.9.
  3. Our third choice concerns the set of variables to include as predictors into the imputation model. The general advice is to include as many relevant variables as possible including their interactions (Collins et al. 2001). This may however lead to unwieldy model specifications that could easily get out of hand. Section 3.3 describes the facilities within mice 2.9 for selecting the predictor set.
  4. The fourth choice is whether we should impute variables that are functions of other (incomplete) variables. Many data sets contain transformed variables, sum scores, interaction variables, ratios, and so on. It can be useful to incorporate the transformed variables into the multiple imputation algorithm. Section 3.4 describes how mice 2.9 deals with this situation using passive imputation.
  5. The fifth choice concerns the order in which variables should be imputed. Several strategies are possible, each with their respective pro’s and cons. Section 3.6 shows how the visitation scheme of the MICE algorithm within mice 2.9 is under control of the user
  6. The sixth choice concerns the setup of the starting imputations and the number of iterations. The convergence of the MICE algorithm can be monitored in many ways. Section 4.3 outlines some techniques in mice 2.9 that assist in this task.
  7. The seventh choice is m, the number of multiply imputed data sets. Setting m too low may result in large simulation error, especially if the fraction of missing information is high.

Significance of Predictors

Table 3: Regression of pincome to examine significance of predictors
pincome pnetworth
(Intercept) 67843.438*** −1589.162
mom_age 765.726*** 4276.303***
race_ethAAPI NonHispanic −45621.940** −57439.825
race_ethAIAN Hispanic −71898.806* −135081.226
race_ethAIAN NonHispanic −50782.647** −134459.990*
race_ethBlack Hispanic −66498.424*** −169547.969**
race_ethBlack NonHispanic −54745.115*** −152405.027**
race_ethOther Race Hispanic −52600.891*** −118811.326
race_ethOther Race NonHispanic −56368.568*** −105934.771
race_ethWhite Hispanic −55943.174*** −144242.355**
race_ethWhite NonHispanic −47277.084** −99882.141
mom_educ_hs.L 22019.499*** 71591.805***
mom_educ_hs.Q 6059.871*** 23474.098***
mom_educ_hs.C 1107.549 −615.634
mom_educ_hs^4 −2121.900 −14018.565***
dad_educ_hs.L 26076.534*** 60049.340***
dad_educ_hs.Q 1673.669 12633.649*
dad_educ_hs.C 1146.181 7439.006
dad_educ_hs^4 −357.175 −4198.144
has_retsav 16538.399*** 44285.502***
owns_home 12387.903*** 68008.452***
both_parents −1833.451 21864.080***
par_dec −2769.377 14312.449
Num.Obs. 2973 2973
R2 0.324 0.314
R2 Adj. 0.319 0.309
AIC 70546.8 78287.6
BIC 70690.7 78431.5
Log.Lik. −35249.385 −39119.783
F 64.307 61.428
RMSE 34117.48 125418.88

Checking for multi-collinearity - correlation plot for numeric variables

                 age_resp   mom_age  has_retsav  owns_home both_parents
age_resp      1.000000000 0.2115985 0.007221412 0.02020203   -0.0336063
mom_age       0.211598517 1.0000000 0.155942508 0.20824653    0.2347601
has_retsav    0.007221412 0.1559425 1.000000000 0.41849655    0.2698076
owns_home     0.020202029 0.2082465 0.418496546 1.00000000    0.3737127
both_parents -0.033606299 0.2347601 0.269807569 0.37371271    1.0000000

Checking for multi-collinearity - VIF and Successive addition of regressors

Checking for missings in the predictor variables

          id      pincome    pnetworth      mom_age  mom_educ_hs  dad_educ_hs 
           0         2361         2327          593         1039         3385 
    race_eth   has_retsav    owns_home both_parents 
           0         1158         1583            0 
              id pincome pnetworth mom_age mom_educ_hs dad_educ_hs race_eth
id           NaN     NaN       NaN     NaN         NaN         NaN      NaN
pincome        1   0.000     0.330   0.903       0.839       0.579        1
pnetworth      1   0.320     0.000   0.903       0.837       0.651        1
mom_age        1   0.614     0.619   0.000       0.671       0.540        1
mom_educ_hs    1   0.633     0.635   0.812       0.000       0.363        1
dad_educ_hs    1   0.706     0.760   0.919       0.804       0.000        1
race_eth     NaN     NaN       NaN     NaN         NaN         NaN      NaN
has_retsav     1   0.068     0.000   0.872       0.788       0.603        1
owns_home      1   0.340     0.303   0.896       0.817       0.587        1
both_parents NaN     NaN       NaN     NaN         NaN         NaN      NaN
             has_retsav owns_home both_parents
id                  NaN       NaN          NaN
pincome           0.543     0.557            1
pnetworth         0.502     0.526            1
mom_age           0.750     0.722            1
mom_educ_hs       0.763     0.721            1
dad_educ_hs       0.864     0.807            1
race_eth            NaN       NaN          NaN
has_retsav        0.000     0.105            1
owns_home         0.346     0.000            1
both_parents        NaN       NaN          NaN

Checking for MCAR

The two pairs of boxplots on the edges of Figure 1 show distributions of pincome and pwealth when the other variable is present and when it’s missing. Variable pnetworth has somewhat lower mean in the subsample where pincome is missing than in the subsample where pincome is present. The distribution of pincome in the subsample where pnetworth is missing is shifted upward relative to its distribution in the subsample where pnetworth is present. In each case, these distributions are relatively similar, and the MCAR assumption is not implausible.

Figure 1: Marginal plot for parental income and wealth

Multiple Imputation of Parental Income and Networth

Table 4: Imputation-methods vector
x
id
pincome pmm
pnetworth pmm
mom_age pmm
mom_educ_hs polr
dad_educ_hs polr
race_eth
has_retsav logreg
owns_home logreg
both_parents
Table 5: Imputation predictor matrix
id pincome pnetworth mom_age mom_educ_hs dad_educ_hs race_eth has_retsav owns_home both_parents
id 0 1 1 1 1 1 1 1 1 1
pincome 0 0 1 1 1 1 1 1 1 1
pnetworth 0 1 0 1 1 1 1 1 1 1
mom_age 0 1 1 0 1 1 1 1 1 1
mom_educ_hs 0 1 1 1 0 1 1 1 1 1
dad_educ_hs 0 1 1 1 1 0 1 1 1 1
race_eth 0 1 1 1 1 1 1 1 1 1
has_retsav 0 1 1 1 1 1 1 0 1 1
owns_home 0 1 1 1 1 1 1 1 0 1
both_parents 0 1 1 1 1 1 1 1 1 0

Distribution Plots for Parent Income (non-imputed and imputed)

Figure 2: Distribution of parental income, original and imputed

Figure 3: Distribution of parental income by race, original and imputed

Distribution Plots for Parent Wealth (non-imputed and imputed)

Figure 4: Distribution of parental net worth, original and imputed

Figure 5: Distribution of parental net worth by race, original and imputed

Distribution Plots for Dad Education (non-imputed and imputed)

Table 6: Total number of observations and number of missing values of father’s education by race and ethnicity
race_eth N dad_educ_miss
AAPI Hispanic 4 1
AAPI NonHispanic 156 37
AIAN Hispanic 18 16
AIAN NonHispanic 43 14
Black Hispanic 54 31
Black NonHispanic 2333 1411
Other Race Hispanic 944 398
Other Race NonHispanic 119 43
White Hispanic 819 303
White NonHispanic 4406 1131

Figure 6: Shares of father’s educational attainment for the original and imputed data

Figure 7: Shares of father’s educational attainment for the original and imputed data by race and ethnicity

Whisker Plots

Scatter Plots for both parent income and wealth (non-imputed is zero)

Figure 8: Scatterplots of parental income and networth, original and imputed

Imputation of Retirement Savings

Table 7: Methods vector for the imputation of retirement savings
x
id
pincome
pnetworth
mom_age
mom_educ_hs
dad_educ_hs
race_eth
has_retsav
owns_home
both_parents
retsav pmm
Table 8: Predictor matrix for the imputation of retirement savings
id pincome pnetworth mom_age mom_educ_hs dad_educ_hs race_eth has_retsav owns_home both_parents retsav
id 0 1 1 1 1 1 1 0 1 1 1
pincome 0 0 1 1 1 1 1 0 1 1 1
pnetworth 0 1 0 1 1 1 1 0 1 1 1
mom_age 0 1 1 0 1 1 1 0 1 1 1
mom_educ_hs 0 1 1 1 0 1 1 0 1 1 1
dad_educ_hs 0 1 1 1 1 0 1 0 1 1 1
race_eth 0 1 1 1 1 1 1 0 1 1 1
has_retsav 0 1 1 1 1 1 1 0 1 1 1
owns_home 0 1 1 1 1 1 1 0 0 1 1
both_parents 0 1 1 1 1 1 1 0 1 0 1
retsav 0 1 1 1 1 1 1 0 1 1 0

Distribution Plots for Retirement Savings (non-imputed and imputed)

Figure 9: Distribution of parental retirement savings, original and imputed

Figure 10: Distribution of parental retirement savings by race, original and imputed

Figure 11: Scatterplots of parental retirement savings and networth, original and imputed

Figure 12: Scatter plot of parental networth and networth without retirement savings

Figure 13: Distribution of parental networth and networth without retirement savings

Figure 14: Distribution of parental networth and networth without retirement savings by race